Introduction

In the previous assignment, we had sourced, cleaned, and normalized dataset GSE77938. This dataset explores the factors contributing to keratoconus (KTCN). Keratoconus is a disorder of the eye that results in progressive thinning of the cornea, which may result in blurry vision, double vision, nearsightedness, astigmatism, and light sensitivity. This study performed comprehensive transcriptome profiling of human KTCN corneas using an RNA-Seq approach, the discovery analysis comparing eight KTCN (test condition) and eight non-KTCN (control) corneas (Kabza et al. 2017).

Here, we aim to determine pathways associated with significantly up- and down-reguated genes in KTCN corneas. First, we load the normalized data.(It is necessary to run expr_cleanup.Rmd first.)

normalized_counts <- readRDS("ncounts.rds")
head(normalized_counts)
##               KR_38      KR_50     KR_51      KR_53     KR_56       KR_57
## TSPAN6   17.6847346  17.931933 21.419169 19.8493200 18.199512  45.1612115
## DPM1     33.8829263  41.168775 33.111375 47.1734102 38.261798  42.2485340
## SCYL3    46.0008141  39.563929 41.151007 49.5086244 53.934790  49.5253243
## C1orf112  7.9248115   8.548038  8.248067  7.7770970  7.247688   6.7962475
## FGR       0.5843651   1.292793  1.806436  0.1772261  6.444768   0.4805428
## CFH      45.4369530 190.051740 61.994498 18.3793861 98.288072 136.2289662
##                KR_61      KR_64     KC_35       KC_37     KC_39      KC_40
## TSPAN6    23.2198029 34.6554399 50.601788 46.56161510 20.369263 16.8256110
## DPM1      45.2930919 49.3238362 38.957243 47.17814772 43.513469 37.4793308
## SCYL3     47.8640626 56.2308579 49.530906 54.17445258 50.787363 57.3425856
## C1orf112  11.2682635 10.4567977  8.119991  7.63964324  6.329232  9.1806857
## FGR        0.3937523  0.3730273  0.124763  0.02680577  0.129891  0.2484318
## CFH      102.6303709 46.7848439 52.223707 43.92571480 24.443116  9.1806857
##                KC_43      KC_45      KC_47       KC_52
## TSPAN6   33.65623349 20.6153992 32.9678451 34.22118135
## DPM1     39.73688496 38.2531392 41.7541895 41.61720101
## SCYL3    48.59923333 51.2505126 48.1561356 50.95507375
## C1orf112  6.21858686  7.4550152  7.1967456  6.41978751
## FGR       0.01149462  0.2390821  0.1415396  0.09550097
## CFH      49.69122178 23.8647425 38.6294301 49.07688798

Differential Expression

First, we conduct differential expression analysis with this normalized expression set. Based on previously generated MDS, the conditions themselves seem to cluster more than any other factors; and indeed no other factors are noted in the dataset. As such, condition is the only factor we will consider in our model.

samples <- data.frame(
        lapply(colnames(normalized_counts), 
        FUN=function(x){
          unlist(strsplit(x, split = "_"))[1]}))
colnames(samples) <- colnames(normalized_counts)
rownames(samples) <- c("cell_type")
samples <- data.frame(t(samples))
head(samples)

We then create a design matrix.

model_design <- model.matrix(~ samples$cell_type)
head(model_design)
##   (Intercept) samples$cell_typeKR
## 1           1                   1
## 2           1                   1
## 3           1                   1
## 4           1                   1
## 5           1                   1
## 6           1                   1

This is inverted from the ‘positive’ and ‘negative’ conditions in the paper (KC is the set of interest (KTCN), KR is the control (non-KTCN).) Thus we reverse it before using the model.

model_design[,2] <- abs(model_design[,2]-1)
head(model_design)
##   (Intercept) samples$cell_typeKR
## 1           1                   0
## 2           1                   0
## 3           1                   0
## 4           1                   0
## 5           1                   0
## 6           1                   0

The column name is still “celltypeKR”, but that’s alright (has no bearing on the analysis).

Create data matrix and fit the data to the above model.

expressionMatrix <- as.matrix(normalized_counts)
rownames(expressionMatrix) <- rownames(normalized_counts)
colnames(expressionMatrix) <- colnames(normalized_counts)
minimalSet <- ExpressionSet(assayData=expressionMatrix)
fit <- lmFit(minimalSet, model_design)

Apply empircal Bayes to compute differential expression for the above described model. Use trend=TRUE as this is RNASeq Data. We used the Benjamini-Hochberg method for mutiple hypothesis correction. This was chosen because BH is specifically optimized to correct for false discovery rate. This is of particular importance in clinical samples, as a falsely enriched gene might point towards a non-existent potential drug target for the disease. However, given the sample size which isn’t too large, Bonferroni and other methods would be too strict, not allowing potentially ‘true’ discoveries to make it through the correction filter.

fit2 <- eBayes(fit,trend=TRUE)

#edgeR -- 

topfit <- topTable(fit2, 
                   coef=ncol(model_design),
                   adjust.method = "BH",
                   number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(rownames(normalized_counts),
                     topfit,
                     by.y=0,by.x=1,
                     all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
head(output_hits)

How many gene pass the threshold p-value < 0.05? How many pass the correction?

length(which(output_hits$P.Value < 0.05))
## [1] 4654
length(which(output_hits$adj.P.Val < 0.05))
## [1] 1001

We can show the overall Volcano plot below. The top gene highlighted is BPTF, a transcription factor (the role of transcription factors in KTCN being discussed later in this report).

We can also visualize the top hits using a heatmap. As we can see below, the conditions do cluster together. This is reassuring, as it shows that the expression differences in our top hits are truly associated with the disease phenotype.

Thresholded over-representation analysis

With our significantly up-regulated and down-regulated set of genes, we can run a thresholded gene set enrichment analysis. A positive log Fold Change indicates an upregulation in the disease state, and a negative a downregulation in the disease state. Here we can see that we have 2562 upregulated genes, and 2092 downregulated genes.

length(which(output_hits$P.Value < 0.05 & output_hits$logFC > 0))
## [1] 2092
length(which(output_hits$P.Value < 0.05 & output_hits$logFC < 0))
## [1] 2562

Thresholded list:

output_hits_withgn <- merge(rownames(normalized_counts),output_hits)
output_hits_withgn[,"rank"] <- -log(output_hits_withgn$P.Value,base =10) * sign(output_hits_withgn$logFC)
output_hits_withgn <- output_hits_withgn[order(output_hits_withgn$rank),]
upregulated_genes <- output_hits_withgn$x[
  which(output_hits_withgn$P.Value < 0.05 
             & output_hits_withgn$logFC > 0)]
downregulated_genes <- output_hits_withgn$x[
  which(output_hits_withgn$P.Value < 0.05 
             & output_hits_withgn$logFC < 0)]
write.table(x=upregulated_genes,
            file="upregulated_genes.txt",sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
            file="downregulated_genes.txt",sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
all_genes <- output_hits_withgn$x

I used gProfiler for the thresholded enrichment analysis. This was because it is a common tool, drawing from multiple annotation sources mentioned in class (ex. KEGG, GO, Reactome) and it has a comfortable R interface. This draws the latest versions from each annotation source separately (as of March 3, 2020).

Both when the analysis is run using up-regulated and all genes, the top results are transcription factors (ordered gene lists are used). There does not seem to be a large difference in the type of transcription factors that are shown as enriched using each method. However, it should be noted that the original publication performed overrepresentation analysis on upregulated and downregulated genes separately, so analysing the pathways in this manner is consistent with the way the authors approached the problem.

enrichment <- gost(all_genes, ordered_query = TRUE)
significant_pathways <- enrichment$result$term_name[order(enrichment$result$p_value)]
head(significant_pathways)
## [1] "Factor: E2F; motif: GGCGSG; match class: 1"     
## [2] "Factor: ER81; motif: RCCGGAARYN; match class: 1"
## [3] "focal adhesion"                                 
## [4] "cell-substrate adherens junction"               
## [5] "anchoring junction"                             
## [6] "cell-substrate junction"

How many genes overall with a threshold of P < 0.05?

length(enrichment$result$term_name[which(enrichment$result$p_value < 0.05)])
## [1] 1041

We can also observe where most of these annotations came from and what categories they fall into (most are transcription factors, which is discussed below).

With only up-regulated genes:

enrichment_up <- gost(upregulated_genes, ordered_query = TRUE)
significant_pathways_up <- enrichment_up$result$term_name[order(enrichment_up$result$p_value)]
head(significant_pathways_up)
## [1] "Factor: ETF; motif: GVGGMGG; match class: 1"      
## [2] "Factor: SP4; motif: SCCCCGCCCCS"                  
## [3] "Factor: SP4; motif: SCCCCGCCCCS; match class: 0"  
## [4] "Factor: E2F-3; motif: GGCGGGN; match class: 1"    
## [5] "Factor: AP-2; motif: MKCCCSCNGGCG; match class: 1"
## [6] "Factor: Sp1; motif: NGGGGGCGGGGYN; match class: 1"

Only down-regulated genes:

enrichment_dwn <- gost(downregulated_genes)
significant_pathways_dwn <- enrichment_dwn$result$term_name[order(enrichment_dwn$result$p_value)]
head(significant_pathways_dwn)
## [1] "anchoring junction"                "extracellular matrix organization"
## [3] "cytoplasm"                         "biological adhesion"              
## [5] "adherens junction"                 "cytoplasmic part"

Interpretation

The main up-regulated factors are ETF, SP4, and other transcription factors: overall, what we see is an upregulation in DNA binding transcription factor actiity. Among these are genes associated with retinal diseases - such as E2F, which is known to be upregulated in retinoblasoma as it is associated with Cdk6 and the Ras pathway. This is consistent with the publication, which also specifically mentions upregulation of CREB phosphorylation through the activation of Ras. Other genes are also among those highly enriched, such as the GRIN gene family (ex. GRIN2B), which are also cited in the paper as upregulated, as part of the CREB phosphorylation and as part of the activation of the NMDA receptor upon glutamate binding and postsynaptic events.

The down-regulated pathways present a somewhat clearer and more coherent picture. The main down-regulated pathways are adherens-junction related: “anchoring junction”, “adherens junction”, “extracellular matrix organization” etc. This is supported in the paper as well, as extracellular matrix organization was for the authors also a main down-regulated pathway. The authors state that lower expression of nearly all genes involved in collagen maturation and the ECM stability was observed, and this is replicated in this analysis. This makes sense, as it is known that there are pronounced abnormalities in the extracellular matrix in keratoconus patients (Kenney et al. 1997).

References

Packages used:

limma Biobase ComplexHeatmap circlize knitr gprofiler2

Kabza, Michal, Justyna Karolak, Malgorzata Rydzanicz, Michal Szczesniak, Dorota Nowak, Barbara Ginter-Matuszewska, Piotr Polakowski, Rafal Ploski, Jacek Szaflik, and Marzena Gajecka. 2017. “Collagen Synthesis Disruption and Downregulation of Core Elements of Tgf-??, Hippo, and Wnt Pathways in Keratoconus Corneas.” European Journal of Human Genetics, no. 25: 582–90.

Kenney, MC, AB Nesburn, RE Burgeson, RJ Butkowski, and AV Ljubimov. 1997. “Abnormalities of the Extracellular Matrix in Keratoconus Corneas.” Cornea 16 (3): 582–90.